Whole genome sequencing of a novel sea anemone (Actinostola sp.) from a deep-sea hydrothermal vent

Deep-sea hydrothermal vents are usually considered as extreme environments with high pressure, high temperature, scarce food, and chemical toxicity, while many local inhabitants have evolved special adaptive mechanisms for residence in this representative ecosystem. In this study, we constructed a high-quality genome assembly for a novel deep-sea anemone species (Actinostola sp.) that was resident at a depth of 2,971 m in an Edmond vent along the central Indian Ocean ridge, with a total size of 424.3 Mb and a scaffold N50 of 383 kb. The assembled genome contained 265 Mb of repetitive sequences and 20,812 protein-coding genes. Taken together, our reference genome provides a valuable genetic resource for exploring the evolution and adaptive clues of this deep-sea anemone.

For further repeat annotation, a total of 265-Mb data covering 62.4% of the total assembled genome were predicted to be repeat sequences.Among them, 25.5% of the genome (108.2Mb) was DNA repeat elements, 8.4% (35.6 Mb) was long interspersed nuclear elements (LINE), 14.3% (60.6 Mb) was long terminal repeats (LTR), and 0.8% (3.6 Mb) was short interspersed nuclear elements (SINE).After masking those repetitive regions, we applied an integrated method of homologous sequence search and de novo gene prediction to obtain annotations of 20,812 protein-coding genes in the assembled genome.By searching four public databases including GO (Gene ontology) 8 , KEGG (Kyoto Encyclopedia of Genes and Genomes) 9 , SwissProt 10 and TrEMBL 11 , we found that 97.89% (19,111 in total) of these predicted genes were functionally annotated.
The coding sequences (CDS), predicted from assembled genomes of Actinostola sp.(this study) and other seven representative species (Fig. 1c), were utilized for clustering of gene families.Eventually, the 20,812 protein-coding genes of Actinostola sp. were clustered into 10,327 gene families, among them 3,526 were  single-copy orthologous.A phylogenetic tree (Fig. 1c) was constructed based on these single-copy orthologous gene families with the maximum likelihood method, predicting that the divergence of our newfound Actinostola sp. from another sea anemone Exaiotasia diaphana occurred 305 million years ago (Mya).This high-quality reference genome for Actinostola sp. can also provide novel insights for enhancing wild resource conservation, discovering new functional genes, developing novel marine drugs, and elucidating special adaptive mechanisms.

Methods
Sample collection, library construction, and genome sequencing.A specimen of the Actinostola sp. was collected from an Edmond vent along the central Indian Ocean ridge for whole genome sequencing.Genomic DNA (gDNA) was extracted using QIAwave DNA Blood & Tissue Kit (Qiagen, Germantown, MD, USA).The genome was sequenced using a combination of sequencing techniques, including paired-end sequencing with a 500-bp inserted library on an Illumina Hiseq Xten platform (Illumina Inc., San Diego, CA, USA), and a PacBio library with an insert-size of 20 kb on a PacBio sequencing platform (Pacific Biosciences, Menlo Park, CA, USA).

Genome size estimation.
The Illumina short reads were filtered with SOAPfilter v2.2 12 .Clean reads were then used for estimation of the Actinostola sp.genome size with a 17-mer frequency distribution analysis according to the following formula 13 : Genome Size = Kmer_num/peak_depth, where k-mer_num is the total number of reads and peak_depth denotes the estimated peak frequency of 17-mers.
Genome assembly.Before assembly, the PacBio long sequencing reads were calibrated using LoRDEC 14 , along with the clean Illumina short reads.After correction, DBG2OLC 15 was applied to assemble these long reads to contigs with assistance of the clean short reads.To further improve the genome accuracy, two rounds of polishing was performed with different strategies.First, Racon v1.3.1 16 was employed for contigs polishing based on the uncorrected PacBio long reads.Second, the clean short reads were used to polish the contigs with pilon 17 .
After heterozygosity reducing with Redundans 18 , we obtained a polished genome assembly for the sequenced Actinostola sp.BUSCO 19 v5.22 provided quantitative measurements for the completeness of this assembly with the popular eukaryota_odb9 database as the reference.

Genome annotation.
We predicted repeat elements by de novo and homology annotations.RepeatModeler 20 and LTR-FINDER 21 were employed for the de novo prediction to build a repeat library.Then, the two libraries were combined and aligned to the assembled genome with RepeatMasker 22 .For the homology prediction, a known repeat library (Repbase 23 ) was employed to identify repeats with RepeatMasker and RepeatProteinMask 22 .Tandem repeats were detected using Tandem Repeat Finder 24 .Finally, by integrating these data from both methods, a nonredundant set of repeat elements were obtained.To predict protein-coding genes, protein sequences form nine representative species including California sea hare (Aplysia californica), nematode (Caenorhabditis elegans), sacoglossan sea slug (Elysia chlorotica), limpet (Lottia gigantea), two-spot octopus (Octopus bimaculoides), invasive apple snail (Pomacea canaliculata), glass anemone (Exaiptasia pallida), starlet sea anemone (Nematostella vectensis), and human (Homo sapiens), were downloaded from Ensembl 25 , and then they were mapped to our assembled genome with TBLASTn 26 .Subsequently, gene structures were predicted by GeneWise 27 .Finally, we integrated all these predicted results using MAKER 28 to obtain a consistent gene set.
For functional annotation, BLASTp 29 was applied to align the predicted protein sequences against four public databases (including SwissProt 10 , TrEMBL 10 , KEGG 30 and InterPro 8 ), and then these results were retrieved to obtain GO 31 terms.

Data records
Our final assembly and annotation data have been deposited at the NCBI with accession number JAUJYZ000000000 32 .Protein and gene coding sequences are uploaded into FigShare depository for public accession 33 .The raw reads of PacBio and Illumina sequencing were also uploaded at the NCBI with accession numbers SRR25988563-SRR25988567 34 .

technical Validation
The genome assembly was 424.3 Mb with a scaffold N50 of 383 kb.For quantitative assessment of this genome assembly, we showed that 83.2% of the reference BUSCO genes (insecta_db9) were successfully identified in the final genome assembly version, suggesting remarkable completeness of this Actinostola sp.genome assembly.

Fig. 1
Fig. 1 Sampling details and comparative analyses of the deep-sea anemone.(a) Image of the sequenced Actinostola sp.(b) Genome survey.(c) Gene family analysis and divergence time of seven representative Cnidaria species.

Table 1 .
Summary of the genome assembly for the sequenced Actinostola sp.